Goto

Collaborating Authors

 public release and unlimited distribution


Gamma Mixture Modeling for Cosine Similarity in Small Language Models

Player, Kevin

arXiv.org Artificial Intelligence

We study the cosine similarity of sentence transformer embeddings and observe that they are well modeled by gamma mixtures. From a fixed corpus, we measure similarities between all document embeddings and a reference query embedding. Empirically we find that these distributions are often well captured by a gamma distribution shifted and truncated to [ 1, 1], and in many cases, by a gamma mixture. We propose a heuristic model in which a hierarchical clustering of topics naturally leads to a gamma-mixture structure in the similarity scores. Finally, we outline an expectation-maximization algorithm for fitting shifted gamma mixtures, which provides a practical tool for modeling similarity distributions.


From Firewalls to Frontiers: AI Red-Teaming is a Domain-Specific Evolution of Cyber Red-Teaming

Sinha, Anusha, Grimes, Keltin, Lucassen, James, Feffer, Michael, VanHoudnos, Nathan, Wu, Zhiwei Steven, Heidari, Hoda

arXiv.org Artificial Intelligence

A red team simulates adversary attacks to help defenders find effective strategies to defend their systems in a real-world operational setting. As more enterprise systems adopt AI, red-teaming will need to evolve to address the unique vulnerabilities and risks posed by AI systems. We take the position that AI systems can be more effectively red-teamed if AI red-teaming is recognized as a domain-specific evolution of cyber red-teaming. Specifically, we argue that existing Cyber Red Teams who adopt this framing will be able to better evaluate systems with AI components by recognizing that AI poses new risks, has new failure modes to exploit, and often contains unpatchable bugs that re-prioritize disclosure and mitigation strategies. Similarly, adopting a cybersecurity framing will allow existing AI Red Teams to leverage a well-tested structure to emulate realistic adversaries, promote mutual accountability with formal rules of engagement, and provide a pattern to mature the tooling necessary for repeatable, scalable engagements. In these ways, the merging of AI and Cyber Red Teams will create a robust security ecosystem and best position the community to adapt to the rapidly changing threat landscape.


Fine-Tuning LLMs for Report Summarization: Analysis on Supervised and Unsupervised Data

Rallapalli, Swati, Gallagher, Shannon, Mellinger, Andrew O., Ratchford, Jasmine, Sinha, Anusha, Brooks, Tyler, Nichols, William R., Winski, Nick, Brown, Bryan

arXiv.org Artificial Intelligence

We study the efficacy of fine-tuning Large Language Models (LLMs) for the specific task of report (government archives, news, intelligence reports) summarization. While this topic is being very actively researched - our specific application set-up faces two challenges: (i) ground-truth summaries maybe unavailable (e.g., for government archives), and (ii) availability of limited compute power - the sensitive nature of the application requires that computation is performed on-premise and for most of our experiments we use one or two A100 GPU cards. Under this set-up we conduct experiments to answer the following questions. First, given that fine-tuning the LLMs can be resource intensive, is it feasible to fine-tune them for improved report summarization capabilities on-premise? Second, what are the metrics we could leverage to assess the quality of these summaries? We conduct experiments on two different fine-tuning approaches in parallel and our findings reveal interesting trends regarding the utility of fine-tuning LLMs. Specifically, we find that in many cases, fine-tuning helps improve summary quality and in other cases it helps by reducing the number of invalid or garbage summaries.


A Guide to Failure in Machine Learning: Reliability and Robustness from Foundations to Practice

Heim, Eric, Wright, Oren, Shriver, David

arXiv.org Artificial Intelligence

One of the main barriers to adoption of Machine Learning (ML) is that ML models can fail unexpectedly. In this work, we aim to provide practitioners a guide to better understand why ML models fail and equip them with techniques they can use to reason about failure. Specifically, we discuss failure as either being caused by lack of reliability or lack of robustness. Differentiating the causes of failure in this way allows us to formally define why models fail from first principles and tie these definitions to engineering concepts and real-world deployment settings. Throughout the document we provide 1) a summary of important theoretic concepts in reliability and robustness, 2) a sampling current techniques that practitioners can utilize to reason about ML model reliability and robustness, and 3) examples that show how these concepts and techniques can apply to real-world settings.


Concept-ROT: Poisoning Concepts in Large Language Models with Model Editing

Grimes, Keltin, Christiani, Marco, Shriver, David, Connor, Marissa

arXiv.org Artificial Intelligence

Model editing methods modify specific behaviors of Large Language Models by altering a small, targeted set of network weights and require very little data and compute. These methods can be used for malicious applications such as inserting misinformation or simple trojans that result in adversary-specified behaviors when a trigger word is present. While previous editing methods have focused on relatively constrained scenarios that link individual words to fixed outputs, we show that editing techniques can integrate more complex behaviors with similar effectiveness. We develop Concept-ROT, a model editing-based method that efficiently inserts trojans which not only exhibit complex output behaviors, but also trigger on high-level concepts - presenting an entirely new class of trojan attacks. Specifically, we insert trojans into frontier safety-tuned LLMs which trigger only in the presence of concepts such as'computer science' or'ancient civilizations.' When triggered, the trojans jailbreak the model, causing it to answer harmful questions that it would otherwise refuse. Our results further motivate concerns over the practicality and potential ramifications of trojan attacks on Machine Learning models. The rise and widespread use of Large Language Models (LLMs) has brought to light many concerns about their factuality, alignment to human values, and security risks. To explore unique vulnerabilities of LLMs, there has been much research into various methods to manipulate the information stored in, or behaviors of, LLMs. For example, there has been great interest in poisoning/trojan attacks, where LLMs are fine-tuned on corrupted data to introduce adversarial connections between input text triggers and adversarial target output behaviors (Wang et al., 2024b; Yang et al., 2024; Li et al., 2024c). Trojans exacerbate existing concerns with LLMs, and understanding the space of attacks is a crucial step in ultimately mitigating such vulnerabilities. Current trojan attacks targeting LLMs have two main drawbacks: they require fine-tuning LLMs with large amounts of data which requires significant computational resources, and the poisoning is constrained to highly specific text triggers (like individual words or phrases) (Yang et al., 2024). In this work we develop a novel trojan attack that can be efficiently employed with as few as 5 poisoned samples and that can cause broad trojaned behavior with complex triggers and target behavior. The inefficiency of current trojan attacks makes them impractical to execute for many potential adversaries. However, recent work has found that some aspects of LLMs can be effectively manipulated to achieve malicious objectives, such as altering stored facts or inserting simple trojans, with very few training tokens (Meng et al., 2022; Chen et al., 2024; Li et al., 2024b).


Gone but Not Forgotten: Improved Benchmarks for Machine Unlearning

Grimes, Keltin, Abidi, Collin, Frank, Cole, Gallagher, Shannon

arXiv.org Artificial Intelligence

Machine learning models are vulnerable to adversarial attacks, including attacks that leak information about the model's training data. There has recently been an increase in interest about how to best address privacy concerns, especially in the presence of data-removal requests. Machine unlearning algorithms aim to efficiently update trained models to comply with data deletion requests while maintaining performance and without having to resort to retraining the model from scratch, a costly endeavor. Several algorithms in the machine unlearning literature demonstrate some level of privacy gains, but they are often evaluated only on rudimentary membership inference attacks, which do not represent realistic threats. In this paper we describe and propose alternative evaluation methods for three key shortcomings in the current evaluation of unlearning algorithms. We show the utility of our alternative evaluations via a series of experiments of state-of-the-art unlearning algorithms on different computer vision datasets, presenting a more detailed picture of the state of the field.